Home Credit — EDA & Modeling

Author

Whitney Bullock · Thomas Beck · Caleb Call

Published

November 19, 2025

Loaded Libraries:

[tidyverse, skimr, DataExplorer, corrplot, naniar, caret, tidymodels, xgboost, kableExtra, Matrix, VennDiagram, grid, rpart, rpart.plot, pROC, randomForest, ranger, knitr, broom, foreach, parallel, scales, Ckmeans.1d.dp]

Business Problem

Home Credit’s operational model is fundamentally about financial inclusion. Loans are provided to populations with little or no credit history—groups often rejected by traditional banks. This creates a risk assessment challenge: evaluating repayment likelihood without standard credit scores.

The core issue is information asymmetry. Strict approval criteria exclude creditworthy applicants, failing the corporate mission. Lenient criteria result in financial losses from defaults.

The goal is not merely prediction accuracy, as only 8% of the population defaults. A basic model could achieve 92% accuracy simply by approving every applicant, but this fails to manage risk. Instead, applicants must be ranked by default likelihood using AUC-ROC. This approach identifies specific “invisible risk” cases for further review.

Methodology

The standard CRISP-DM framework structures the analysis:

  • Data Preparation: Missing data is handled (imputing EXT_SOURCE_1 with the median), empty columns are removed, and features are engineered to capture age and employment length in years.
  • Exploratory Data Analysis (EDA): The 92/8 class imbalance is verified, and external credit scores are identified as the strongest predictors.
  • Modeling: A progression of models is built, beginning with Logistic Regression for a baseline, moving to Decision Trees for interpretability, and finally using Random Forest and XGBoost for maximum predictive power.
  • Evaluation: AUC-ROC is the primary metric used to select the champion model for risk ranking.

Data Preparation & Feature Engineering

The raw data is imported and a summary of dataset dimensions is generated.

Dataset Dimensions
dataset n_rows n_cols
train 307,511 122
test 48,744 121

Missing values for EXT_SOURCE_1 are filled using the median to preserve signal, and columns with greater than 50% missingness are dropped.

Data inconsistencies are corrected, such as converting negative days to positive years and fixing a sentinel value in the employment data.

Exploratory Data Analysis

Target Distribution

Target Distribution
class count pct
Repaid (0) 282,686 0.919
Default (1) 24,825 0.081

As expected, the data is heavily skewed. 91.93% of applicants repay their loans, while only 8.07% default. This reinforces why “accuracy” is an insufficient metric—focus must be placed on identifying the 8% minority class.

Key Categorical Relationships

Categories with the highest default rates are analyzed to identify areas of risk. The faceted plot below highlights the specific attributes driving default risk across multiple categories.

Income source displays significant risk variance, with unemployed applicants and those on maternity leave exhibiting the highest default rates.

Numeric Correlations

Top Numeric Correlations
Variable Correlation
EXT_SOURCE_3 -0.179
EXT_SOURCE_2 -0.160
export_source_1_imputed -0.099
DAYS_BIRTH 0.078
AGE_YEARS -0.078
DAYS_EMPLOYED_FIXED 0.075
YEARS_EMPLOYED -0.075
REGION_RATING_CLIENT_W_CITY 0.061
REGION_RATING_CLIENT 0.059
DAYS_LAST_PHONE_CHANGE 0.055

External source scores are the strongest predictors (negative correlation indicates higher scores correspond to lower risk). Age and employment history also provide significant predictive power.

AI-Assisted Feature Prioritization

The column list was submitted to an AI credit analyst to generate a domain-based prioritization of predictors.

Click to expand full AI Response

Prompt Sent to AI

You are a credit risk analyst. Below are the column names from my dataset. Based only on the meaning of the variables (no data yet), list the top 15–20 variables you think will be most predictive of the binary target TARGET (1 = default/bad, 0 = good). For each chosen variable give one sentence explaining why you expect it to be predictive. Also list any variables you believe are unlikely to be predictive and why. Finally, identify obvious interactions or non-linear relationships that should be considered later in modeling. Column names: SK_ID_CURR, TARGET, … [Truncated list for report]

AI response

  • EXT_SOURCE_3 — external credit-scoring features are usually highly predictive because they synthesize credit bureau/third-party signals about borrower risk.
  • EXT_SOURCE_2 — an additional external score often provides complementary predictive information to EXT_SOURCE_3.
  • EXT_SOURCE_1 — another external score; together the three EXT_SOURCE variables commonly dominate predictive power.
  • DAYS_BIRTH — borrower age (days) correlates with experience, income stability and default behavior (often non-linear).
  • AMT_CREDIT — loan amount is directly related to repayment burden and default probability (higher credit often higher risk ceteris paribus).
  • AMT_INCOME_TOTAL — household income is fundamental to repayment capacity and moderates other financial ratios.
  • AMT_ANNUITY — monthly/annual payment size relative to income indicates stress; high annuity-to-income increases default risk.
  • AMT_GOODS_PRICE — loan purpose/size proxy; when large relative to income it raises risk.
  • DAYS_EMPLOYED — employment tenure (and special sentinel values) captures job stability; short/irregular employment increases risk.
  • CNT_CHILDREN — dependents increase household obligations and can reduce repayment capacity.
  • CNT_FAM_MEMBERS — household size changes per-capita income dynamics and consumption burden.
  • NAME_INCOME_TYPE — income source (pension, state, working, business) strongly affects stability and default propensity.
  • NAME_EDUCATION_TYPE — education proxies for earning capacity and financial literacy, correlating with risk.
  • NAME_HOUSING_TYPE — housing status (own, mortgage, rent) proxies wealth/stability and collateral availability.
  • OCCUPATION_TYPE — job category captures industry/role risk and stability (e.g., seasonal work).
  • DAYS_ID_PUBLISH — recency of identity issuance may capture recency of life events / identity stability and sometimes correlates with fraud/risk.
  • REGION_RATING_CLIENT / REGION_RATING_CLIENT_W_CITY — regional socio-economic indicators capture local unemployment/credit environment (either one if choose).
  • OBS_60_CNT_SOCIAL_CIRCLE — observations in social circle may indicate social-financial exposures; high observation counts can be informative of shared risk.
  • DEF_60_CNT_SOCIAL_CIRCLE — defaults observed in social circle are plausible contagion indicators of risk.
  • AMT_REQ_CREDIT_BUREAU_YEAR (also MON/QRT/WEEK) — recent credit inquiries reflect active credit shopping and short-term credit stress; year aggregate is often most informative.

Note: I focused on variables that (by domain knowledge) capture borrower ability to pay, historical credit signals, external bureau scores, and household obligations.

Variables likely to be weak or require special handling

  • SK_ID_CURR — identifier only, no predictive content (drop or use for joins).
  • FLAG_MOBIL / many FLAG_DOCUMENT_ fields — these are often almost-constant (e.g., FLAG_MOBIL = 1 for most) and have low variance; include only if their distribution shows signal.
  • WEEKDAY_APPR_PROCESS_START / HOUR_APPR_PROCESS_START — could be spurious unless application-time correlates with channel/fraud; usually weak.
  • Highly aggregated census-style _MODE, _MEDI, _AVG — could be useful, but many are derived from applicant’s neighborhood and can be redundant with REGION_; treat carefully.
  • BASEMENTAREA_ / COMMONAREA_ etc. — building-level averages may be noisy or frequently missing; predictive only if well-populated and aligned with applicant address.
  • FLAG_EMAIL / FLAG_PHONE — if nearly all are 1/0 they carry little info; but missingness patterns might be informative.

Suggested interactions & non-linearities to consider

  • Annuity / Income interaction: AMT_ANNUITY / AMT_INCOME_TOTAL (annuity-to-income ratio) — expected strong non-linear effect.
  • Credit / Income / Age triad: AMT_CREDIT × AMT_INCOME_TOTAL × DAYS_BIRTH — same loan may be riskier for younger/low-income customers.
  • EXT_SOURCEs combined: non-linear combination or monotonic binning of EXT_SOURCE_1/2/3 often beats single-score use.
  • Employment & Income Type: DAYS_EMPLOYED × NAME_INCOME_TYPE — short employment may be more severe for wage-earners than for pensioners.
  • Region × Housing: REGION_RATING_CLIENT × NAME_HOUSING_TYPE — regional economic conditions interact with housing tenure to change risk.
  • Counts extreme effects: CNT_CHILDREN, CNT_FAM_MEMBERS may show non-linear (both 0 and very large values could behave differently); consider binning.

Our Response to AI Generation

The AI correctly prioritized “external source” scores and “age” as top predictors, aligning with the empirical EDA findings. It also highlighted AMT_CREDIT as a key variable. However, data analysis revealed that city ratings and the recency of phone changes were also significant, which the AI did not emphasize. This suggests that domain knowledge must be combined with actual data inspection for a complete picture.

Venn Diagram

EDA Results

The exploratory data analysis confirms that external scores are the dominant predictors. Empirical evidence also points to age, employment duration, and phone change recency as valuable signals. These insights are critical for understanding the customer base and refining the modeling strategy.

Model Analysis

The modeling phase establishes a baseline against the 91.92% repayment rate. The goal is not simple accuracy but optimizing the AUC-ROC score to correctly predict the 8% of applicants who default.

A progression of models was built: Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting. A 70:30 data partition split ensures testing on unseen data.

Logistic Regression Model

Logistic Regression establishes a baseline using the key predictors identified earlier: EXT_SOURCE variables, AGE_YEARS, and YEARS_EMPLOYED.

Model 1 (Key Predictors) AUC: 0.7299 
Model 2 (Gender)         AUC: 0.5464 
Model 3 (Income)         AUC: 0.523 

Logistic Regression Model Analysis

Model 1, utilizing the key predictors, performed reasonably well. Models relying solely on gender or income yielded AUC scores near 0.5, indicating no predictive power. This confirms that demographic variables like gender and income are insufficient on their own.

Decision Tree Model

A Decision Tree was implemented to handle non-linearities and outliers. To address class imbalance, a 5:1 penalty weight was applied to the minority class (default), forcing the model to prioritize identifying high-risk applicants.

Decision Tree Model Analysis

The Decision Tree achieved an AUC of 0.6853. While slightly lower than the Logistic model, it provides clear rules for identifying high-risk groups (e.g., applicants with very low external scores). These rules can be used to automatically flag applications.

Random Forest Model

A Random Forest ensemble of 500 trees was trained to reduce overfitting and improve stability compared to the single decision tree.

Growing trees.. Progress: 56%. Estimated remaining time: 24 seconds.

A variable importance plot illustrates which features were most influential in reducing impurity across the 500 trees.

Random Forest Model Analysis

The Random Forest yielded an AUC of 0.651. This performance was lower than the weighted Decision Tree, likely due to the lack of class weighting in this implementation. However, the importance plot confirms that EXT_SOURCE variables are overwhelmingly the most critical decision factors.

Gradient Boosting Model

Gradient Boosting (XGBoost) was implemented. This is an ensemble method that builds trees sequentially to correct errors from previous trees, making it robust for complex cases.

The Feature Gain plot below shows the relative contribution of each feature to the model’s accuracy.

Gradient Boosting Model Analysis

XGBoost achieved a test AUC of 0.742, outperforming all other models. This result confirms its suitability for risk ranking. The Feature Gain plot highlights that EXT_SOURCE_3 and EXT_SOURCE_2 are dominant, but DAYS_EMPLOYED and AGE_YEARS also provide specific high-value splits.

Kaggle Submission

The final XGBoost model is applied to the official test data to generate the submission file.

Final Submission Preview
SK_ID_CURR TARGET
100,001 0.161
100,005 0.569
100,013 0.181
100,028 0.242
100,038 0.635
100,042 0.619
100,057 0.115
100,065 0.233
100,066 0.068
100,067 0.371

The submission achieved a Private score of 0.70126 and a Public score of 0.70297 on Kaggle, indicating strong predictive performance.

Conclusion

The analysis identified external credit scores as the most robust predictors of default risk. Higher scores strongly correlate with lower risk. Age and employment history provide secondary predictive value, while demographic factors like gender and income show negligible utility in isolation.

The Gradient Boosting model (XGBoost) demonstrated superior performance (AUC 0.742) compared to Logistic Regression, Decision Trees, and Random Forests. It is recommended that Home Credit utilize this model to prioritize the riskiest applicants (e.g., the top 10%) for manual review, thereby optimizing resource allocation and minimizing potential defaults.